The textbook for the Data Science course series is freely available online.
There are three major sections in this course: introduction to linear regression, linear models, and confounding.
In this section, you’ll learn the basics of linear regression through this course’s motivating example, the data-driven approach used to construct baseball teams. You’ll also learn about correlation, the correlation coefficient, stratification, and the variance explained.
In this section, you’ll learn about linear models. You’ll learn about least squares estimates, multivariate regression, and several useful features of R, such as tibbles, lm, do, and broom. You’ll learn how to apply regression to baseball to build a better offensive metric.
In the final section of the course, you’ll learn about confounding and several reasons that correlation is not the same as causation, such as spurious correlation, outliers, reversing cause and effect, and confounders. You’ll also learn about Simpson’s Paradox.
In the Introduction to Regression section, you will learn the basics of linear regression.
After completing this section, you will be able to:
This section has three parts: Baseball as a Motivating Example, Correlation, and Stratification and Variance Explained.
The corresponding section of the textbook is the case study on Moneyball
Key points
Bill James was the originator of the sabermetrics, the approach of using data to predict what outcomes best predicted if a team would win.
The corresponding section of the textbook is the section on baseball basics
Key points
The corresponding section of the textbook is the base on balls or stolen bases textbook section
Key points
The visualization of choice when exploring the relationship between two variables like home runs and runs is a scatterplot.
Code: Scatterplot of the relationship between HRs and runs
if(!require(Lahman)) install.packages("Lahman")
## Loading required package: Lahman
if(!require(tidyverse)) install.packages("tidyverse")
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
if(!require(dslabs)) install.packages("dslabs")
## Loading required package: dslabs
library(Lahman)
library(tidyverse)
library(dslabs)
ds_theme_set()
Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(HR_per_game = HR / G, R_per_game = R / G) %>%
ggplot(aes(HR_per_game, R_per_game)) +
geom_point(alpha = 0.5)
Code: Scatterplot of the relationship between stolen bases and runs
Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(SB_per_game = SB / G, R_per_game = R / G) %>%
ggplot(aes(SB_per_game, R_per_game)) +
geom_point(alpha = 0.5)
Code: Scatterplot of the relationship between bases on balls and runs
Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(BB_per_game = BB / G, R_per_game = R / G) %>%
ggplot(aes(BB_per_game, R_per_game)) +
geom_point(alpha = 0.5)
Teams %>% filter(yearID %in% 1961:2001 ) %>%
mutate(AB_per_game = AB/G, R_per_game = R/G) %>%
ggplot(aes(AB_per_game, R_per_game)) +
geom_point(alpha = 0.5)
Teams %>% filter(yearID %in% 1961:2001 ) %>%
ggplot(aes(AB, R)) +
geom_point(alpha = 0.5)
Teams %>% filter(yearID %in% 1961:2001 ) %>%
mutate(AB_per_game = AB/G, R_per_game = R/G) %>%
ggplot(aes(AB_per_game, R_per_game)) +
geom_point(alpha = 0.5)
Teams %>% filter(yearID %in% 1961:2001 ) %>%
mutate(AB_per_game = AB/G, R_per_game = R/G) %>%
ggplot(aes(AB_per_game, R_per_game)) +
geom_line()
Teams %>% filter(yearID %in% 1961:2001 ) %>%
mutate(AB_per_game = AB/G, R_per_game = R/G) %>%
ggplot(aes(R_per_game, AB_per_game)) +
geom_point()
Hint: make sure to use the help file (?Teams).
Teams data frame to include years from 1961 to 2001. Make a scatterplot of runs per game versus at bats (AB) per game.Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(AB_per_game = AB / G, R_per_game = R / G) %>%
ggplot(aes(AB_per_game, R_per_game)) +
geom_point(alpha = 0.5)
Which of the following is true?
Teams data frame from Question 6. Make a scatterplot of win rate (number of wins per game) versus number of fielding errors (E) per game.Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(win_rate = W / G, E_per_game = E / G) %>%
ggplot(aes(win_rate, E_per_game)) +
geom_point(alpha = 0.5)
Which of the following is true?
Teams data frame from Question 6. Make a scatterplot of triples (X3B) per game versus doubles (X2B) per game.Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(doubles_per_game = X2B / G, triples_per_game = X3B / G) %>%
ggplot(aes(doubles_per_game, triples_per_game)) +
geom_point(alpha = 0.5)
Which of the following is true?
The corresponding textbook section is Case Study: is height hereditary?
Key points
Code
# create the dataset
if(!require(HistData)) install.packages("HistData")
## Loading required package: HistData
library(HistData)
data("GaltonFamilies")
set.seed(1983)
galton_heights <- GaltonFamilies %>%
filter(gender == "male") %>%
group_by(family) %>%
sample_n(1) %>%
ungroup() %>%
select(father, childHeight) %>%
rename(son = childHeight)
# means and standard deviations
galton_heights %>%
summarize(mean(father), sd(father), mean(son), sd(son))
## # A tibble: 1 x 4
## `mean(father)` `sd(father)` `mean(son)` `sd(son)`
## <dbl> <dbl> <dbl> <dbl>
## 1 69.1 2.55 69.2 2.71
# scatterplot of father and son heights
galton_heights %>%
ggplot(aes(father, son)) +
geom_point(alpha = 0.5)
The corresponding textbook section is the correlation coefficient
Key points
Code
rho <- mean(scale(x)*scale(y))
galton_heights <- GaltonFamilies %>% filter(childNum == 1 & gender == "male") %>% select(father, childHeight) %>% rename(son = childHeight)
galton_heights %>% summarize(r = cor(father, son)) %>% pull(r)
## [1] 0.5007248
The corresponding textbook section is Sample correlation is a random variable
Key points
Code
# compute sample correlation
R <- sample_n(galton_heights, 25, replace = TRUE) %>%
summarize(r = cor(father, son))
R
## r
## 1 0.4787613
# Monte Carlo simulation to show distribution of sample correlation
B <- 1000
N <- 25
R <- replicate(B, {
sample_n(galton_heights, N, replace = TRUE) %>%
summarize(r = cor(father, son)) %>%
pull(r)
})
qplot(R, geom = "histogram", binwidth = 0.05, color = I("black"))
# expected value and standard error
mean(R)
## [1] 0.4970997
sd(R)
## [1] 0.1512451
# QQ-plot to evaluate whether N is large enough
data.frame(R) %>%
ggplot(aes(sample = R)) +
stat_qq() +
geom_abline(intercept = mean(R), slope = sqrt((1-mean(R)^2)/(N-2)))
Scatter plot relationship x and y
From this figure, the correlation between x and y appears to be about:
Would you expect the mean of our sample correlation to increase, decrease, or stay approximately the same?
Would you expect the standard deviation of our sample correlation to increase, decrease, or stay approximately the same?
Teams data frame to include years from 1961 to 2001.What is the correlation coefficient between number of runs per game and number of at bats per game?
Teams_small <- Teams %>% filter(yearID %in% 1961:2001)
cor(Teams_small$R/Teams_small$G, Teams_small$AB/Teams_small$G)
## [1] 0.6580976
Teams data frame from Question 7.What is the correlation coefficient between win rate (number of wins per game) and number of errors per game?
cor(Teams_small$W/Teams_small$G, Teams_small$E/Teams_small$G)
## [1] -0.3396947
Teams data frame from Question 7.What is the correlation coefficient between doubles (X2B) per game and triples (X3B) per game?
cor(Teams_small$X2B/Teams_small$G, Teams_small$X3B/Teams_small$G)
## [1] -0.01157404
There are three links to relevant sections of the textbook:
Key points
Code
# number of fathers with height 72 or 72.5 inches
sum(galton_heights$father == 72)
## [1] 8
sum(galton_heights$father == 72.5)
## [1] 1
# predicted height of a son with a 72 inch tall father
conditional_avg <- galton_heights %>%
filter(round(father) == 72) %>%
summarize(avg = mean(son)) %>%
pull(avg)
conditional_avg
## [1] 71.83571
# stratify fathers' heights to make a boxplot of son heights
galton_heights %>% mutate(father_strata = factor(round(father))) %>%
ggplot(aes(father_strata, son)) +
geom_boxplot() +
geom_point()
# center of each boxplot
galton_heights %>%
mutate(father = round(father)) %>%
group_by(father) %>%
summarize(son_conditional_avg = mean(son)) %>%
ggplot(aes(father, son_conditional_avg)) +
geom_point()
## `summarise()` ungrouping output (override with `.groups` argument)
# calculate values to plot regression line on original data
mu_x <- mean(galton_heights$father)
mu_y <- mean(galton_heights$son)
s_x <- sd(galton_heights$father)
s_y <- sd(galton_heights$son)
r <- cor(galton_heights$father, galton_heights$son)
m <- r * s_y/s_x
b <- mu_y - m*mu_x
# add regression line to plot
galton_heights %>%
ggplot(aes(father, son)) +
geom_point(alpha = 0.5) +
geom_abline(intercept = b, slope = m)
There is a link to the relevant section of the textbook: Bivariate normal distribution (advanced)
Key points
Code
galton_heights %>%
mutate(z_father = round((father - mean(father)) / sd(father))) %>%
filter(z_father %in% -2:2) %>%
ggplot() +
stat_qq(aes(sample = son)) +
facet_wrap( ~ z_father)
There is a link to the relevant section of the textbook: Variance explained
Key points
There is a link to the relevant section of the textbook: Warning: there are two regression lines
Key point
There are two different regression lines depending on whether we are taking the expectation of Y given X or taking the expectation of X given Y.
Code
# compute a regression line to predict the son's height from the father's height
mu_x <- mean(galton_heights$father)
mu_y <- mean(galton_heights$son)
s_x <- sd(galton_heights$father)
s_y <- sd(galton_heights$son)
r <- cor(galton_heights$father, galton_heights$son)
m_1 <- r * s_y / s_x
b_1 <- mu_y - m_1*mu_x
m_1 # slope 1
## [1] 0.5027904
b_1 # intercept 1
## [1] 35.71249
# compute a regression line to predict the father's height from the son's height
m_2 <- r * s_x / s_y
b_2 <- mu_x - m_2*mu_y
m_2 # slope 2
## [1] 0.4986676
b_2 # intercept 2
## [1] 33.96539
Scatter plot and regression line of son and father heights
Given this, what percent of the variation in sons’ heights is explained by fathers’ heights?
When two variables follow a bivariate normal distribution, the variation explained can be calculated as \(\rho^2 \times 100\).
Given a one inch increase in a father’s height, what is the predicted change in the son’s height?
The slope of the regression line is calculated by multiplying the correlation coefficient by the ratio of the standard deviation of son heights and standard deviation of father heights: \(\sigma_{son}/\sigma_{father}\).
In the second part of this assessment, you’ll analyze a set of mother and daughter heights, also from GaltonFamilies.
Define female_heights, a set of mother and daughter heights sampled from GaltonFamilies, as follows:
set.seed(1989, sample.kind="Rounding") #if you are using R 3.6 or later
## Warning in set.seed(1989, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
female_heights <- GaltonFamilies%>%
filter(gender == "female") %>%
group_by(family) %>%
sample_n(1) %>%
ungroup() %>%
select(mother, childHeight) %>%
rename(daughter = childHeight)
Mean of mothers’ heights
mean(female_heights$mother)
## [1] 64.125
Standard deviation of mothers’ heights
sd(female_heights$mother)
## [1] 2.289292
Mean of daughters’ heights
mean(female_heights$daughter)
## [1] 64.28011
Standard deviation of daughters’ heights
sd(female_heights$daughter)
## [1] 2.39416
Correlation coefficient
cor(female_heights$mother, female_heights$daughter)
## [1] 0.3245199
Slope of regression line predicting daughters’ height from mothers’ heights
r <- cor(female_heights$mother, female_heights$daughter)
s_y <- sd(female_heights$daughter)
s_x <- sd(female_heights$mother)
r * s_y/s_x
## [1] 0.3393856
Intercept of regression line predicting daughters’ height from mothers’ heights
mu_y <- mean(female_heights$daughter)
mu_x <- mean(female_heights$mother)
mu_y - (r * s_y/s_x)*mu_x
## [1] 42.51701
Change in daughter’s height in inches given a 1 inch increase in the mother’s height
r * s_y/s_x
## [1] 0.3393856
r^2*100
## [1] 10.53132
What is the conditional expected value of her daughter’s height given the mother’s height?
m = r * s_y/s_x
b = mu_y - (r * s_y/s_x)*mu_x
x = 60
m*x+b
## [1] 62.88015
In the Linear Models section, you will learn how to do linear regression.
After completing this section, you will be able to:
do() function to bridge R functions and the tidyverse.tidy(), glance(), and augment() functions from the broom package.This section has four parts: Introduction to Linear Models, Least Squares Estimates, Tibbles, do, and broom, and Regression and Baseball.
The textbook for this section is available here
Key points
Code
# find regression line for predicting runs from BBs
bb_slope <- Teams %>%
filter(yearID %in% 1961:2001 ) %>%
mutate(BB_per_game = BB/G, R_per_game = R/G) %>%
lm(R_per_game ~ BB_per_game, data = .) %>%
.$coef %>%
.[2]
bb_slope
## BB_per_game
## 0.7353288
# compute regression line for predicting runs from singles
singles_slope <- Teams %>%
filter(yearID %in% 1961:2001 ) %>%
mutate(Singles_per_game = (H-HR-X2B-X3B)/G, R_per_game = R/G) %>%
lm(R_per_game ~ Singles_per_game, data = .) %>%
.$coef %>%
.[2]
singles_slope
## Singles_per_game
## 0.4494253
# calculate correlation between HR, BB and singles
Teams %>%
filter(yearID %in% 1961:2001 ) %>%
mutate(Singles = (H-HR-X2B-X3B)/G, BB = BB/G, HR = HR/G) %>%
summarize(cor(BB, HR), cor(Singles, HR), cor(BB,Singles))
## cor(BB, HR) cor(Singles, HR) cor(BB, Singles)
## 1 0.4039313 -0.1737435 -0.05603822
The textbook for this section is available here
Key points
Code
# stratify HR per game to nearest 10, filter out strata with few points
dat <- Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(HR_strata = round(HR/G, 1),
BB_per_game = BB / G,
R_per_game = R / G) %>%
filter(HR_strata >= 0.4 & HR_strata <=1.2)
# scatterplot for each HR stratum
dat %>%
ggplot(aes(BB_per_game, R_per_game)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
facet_wrap( ~ HR_strata)
## `geom_smooth()` using formula 'y ~ x'
# calculate slope of regression line after stratifying by HR
dat %>%
group_by(HR_strata) %>%
summarize(slope = cor(BB_per_game, R_per_game)*sd(R_per_game)/sd(BB_per_game))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 9 x 2
## HR_strata slope
## <dbl> <dbl>
## 1 0.4 0.734
## 2 0.5 0.566
## 3 0.6 0.412
## 4 0.7 0.285
## 5 0.8 0.365
## 6 0.9 0.261
## 7 1 0.511
## 8 1.1 0.454
## 9 1.2 0.440
# stratify by BB
dat <- Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(BB_strata = round(BB/G, 1),
HR_per_game = HR / G,
R_per_game = R / G) %>%
filter(BB_strata >= 2.8 & BB_strata <=3.9)
# scatterplot for each BB stratum
dat %>% ggplot(aes(HR_per_game, R_per_game)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
facet_wrap( ~ BB_strata)
## `geom_smooth()` using formula 'y ~ x'
# slope of regression line after stratifying by BB
dat %>%
group_by(BB_strata) %>%
summarize(slope = cor(HR_per_game, R_per_game)*sd(R_per_game)/sd(HR_per_game))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 2
## BB_strata slope
## <dbl> <dbl>
## 1 2.8 1.52
## 2 2.9 1.57
## 3 3 1.52
## 4 3.1 1.49
## 5 3.2 1.58
## 6 3.3 1.56
## 7 3.4 1.48
## 8 3.5 1.63
## 9 3.6 1.83
## 10 3.7 1.45
## 11 3.8 1.70
## 12 3.9 1.30
The textbook for this section is available here
Key points
> lm(son ~ father, data = galton_heights)
Call:
lm(formula = son ~ father, data = galton_heights)
Coefficients:
(Intercept) father
35.71 0.50
Interpret the numeric coefficient for “father.”
galton_heights <- galton_heights %>%
mutate(father_centered=father - mean(father))
We run a linear model using this centered fathers’ height variable.
> lm(son ~ father_centered, data = galton_heights)
Call:
lm(formula = son ~ father_centered, data = galton_heights)
Coefficients:
(Intercept) father_centered
70.45 0.50
Interpret the numeric coefficient for the intercept.
\(E[R|BB=x_1, HR=x_2] = \beta_0+\beta_1x_1+\beta_2x_2\)
Suppose we fix \(BB = x_1\). Then we observe a linear relationship between runs and HR with intercept of:
The textbook for this section is available here
Key points
Code
# compute RSS for any pair of beta0 and beta1 in Galton's data
set.seed(1983)
galton_heights <- GaltonFamilies %>%
filter(gender == "male") %>%
group_by(family) %>%
sample_n(1) %>%
ungroup() %>%
select(father, childHeight) %>%
rename(son = childHeight)
rss <- function(beta0, beta1, data){
resid <- galton_heights$son - (beta0+beta1*galton_heights$father)
return(sum(resid^2))
}
# plot RSS as a function of beta1 when beta0=25
beta1 = seq(0, 1, len=nrow(galton_heights))
results <- data.frame(beta1 = beta1,
rss = sapply(beta1, rss, beta0 = 25))
results %>% ggplot(aes(beta1, rss)) + geom_line() +
geom_line(aes(beta1, rss))
The textbook for this section is available here
Key points
lm() function, the variable that we want to predict is put to the left of the ~ symbol, and the variables that we use to predict is put to the right of the ~ symbol. The intercept is added automatically.Code
# fit regression line to predict son's height from father's height
fit <- lm(son ~ father, data = galton_heights)
fit
##
## Call:
## lm(formula = son ~ father, data = galton_heights)
##
## Coefficients:
## (Intercept) father
## 38.7646 0.4411
# summary statistics
summary(fit)
##
## Call:
## lm(formula = son ~ father, data = galton_heights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4228 -1.7022 0.0333 1.5670 9.3567
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.76457 5.41093 7.164 2.03e-11 ***
## father 0.44112 0.07825 5.637 6.72e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.659 on 177 degrees of freedom
## Multiple R-squared: 0.1522, Adjusted R-squared: 0.1474
## F-statistic: 31.78 on 1 and 177 DF, p-value: 6.719e-08
The textbook for this section is available here
Key points
Code
# Monte Carlo simulation
B <- 1000
N <- 50
lse <- replicate(B, {
sample_n(galton_heights, N, replace = TRUE) %>%
lm(son ~ father, data = .) %>%
.$coef
})
lse <- data.frame(beta_0 = lse[1,], beta_1 = lse[2,])
# Plot the distribution of beta_0 and beta_1
if(!require(gridExtra)) install.packages("gridExtra")
## Loading required package: gridExtra
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(gridExtra)
p1 <- lse %>% ggplot(aes(beta_0)) + geom_histogram(binwidth = 5, color = "black")
p2 <- lse %>% ggplot(aes(beta_1)) + geom_histogram(binwidth = 0.1, color = "black")
grid.arrange(p1, p2, ncol = 2)
# summary statistics
sample_n(galton_heights, N, replace = TRUE) %>%
lm(son ~ father, data = .) %>%
summary %>%
.$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.4729422 8.6021831 4.007464 0.0002129225
## father 0.4990193 0.1240572 4.022493 0.0002030210
lse %>% summarize(se_0 = sd(beta_0), se_1 = sd(beta_1))
## se_0 se_1
## 1 9.683973 0.1411404
Although interpretation is not straight-forward, it is also useful to know that the LSE can be strongly correlated, which can be seen using this code:
lse %>% summarize(cor(beta_0, beta_1))
## cor(beta_0, beta_1)
## 1 -0.9993386
However, the correlation depends on how the predictors are defined or transformed.
Here we standardize the father heights, which changes \(x_i\) to \(x_i - \bar{x}\).
B <- 1000
N <- 50
lse <- replicate(B, {
sample_n(galton_heights, N, replace = TRUE) %>%
mutate(father = father - mean(father)) %>%
lm(son ~ father, data = .) %>% .$coef
})
Observe what happens to the correlation in this case:
cor(lse[1,], lse[2,])
## [1] 0.1100929
The textbook for this section is available here
Key points
predict() function in R can give us predictions directly.Code
# plot predictions and confidence intervals
galton_heights %>% ggplot(aes(son, father)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
# predict Y directly
fit <- galton_heights %>% lm(son ~ father, data = .)
Y_hat <- predict(fit, se.fit = TRUE)
names(Y_hat)
## [1] "fit" "se.fit" "df" "residual.scale"
# plot best fit line
galton_heights %>%
mutate(Y_hat = predict(lm(son ~ father, data=.))) %>%
ggplot(aes(father, Y_hat))+
geom_line()
beta1 = seq(0, 1, len=nrow(galton_heights))
results <- data.frame(beta1 = beta1,
rss = sapply(beta1, rss, beta0 = 25))
results %>% ggplot(aes(beta1, rss)) + geom_line() +
geom_line(aes(beta1, rss), col=2)
In a model for sons’ heights vs fathers’ heights, what is the least squares estimate (LSE) for \(\beta_1\) if we assume \(\hat{\beta}_{0}\) is 36? Hint: modify the code above to do your analysis.
# compute RSS for any pair of beta0 and beta1 in Galton's data
set.seed(1983)
galton_heights <- GaltonFamilies %>%
filter(gender == "male") %>%
group_by(family) %>%
sample_n(1) %>%
ungroup() %>%
select(father, childHeight) %>%
rename(son = childHeight)
rss <- function(beta0, beta1, data){
resid <- galton_heights$son - (beta0+beta1*galton_heights$father)
return(sum(resid^2))
}
# plot RSS as a function of beta1 when beta0=36
beta1 = seq(0, 1, len=nrow(galton_heights))
results <- data.frame(beta1 = beta1,
rss = sapply(beta1, rss, beta0 = 36))
results %>% ggplot(aes(beta1, rss)) + geom_line() +
geom_line(aes(beta1, rss))
The least squares estimates for the parameters \(\beta_0, \beta_1, \dots, \beta_n\) minimize the residual sum of squares.
Load the Lahman library and filter the Teams data frame to the years 1961-2001.
Run a linear model in R predicting the number of runs per game based on the number of bases on balls and the number of home runs. Remember to first limit your data to 1961-2001.
What is the coefficient for bases on balls?
fit <- Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(R_per_game = R / G, BB_per_game = BB / G, HR_per_game = HR / G) %>%
lm(R_per_game ~ BB_per_game + HR_per_game, data = .)
summary(fit)
##
## Call:
## lm(formula = R_per_game ~ BB_per_game + HR_per_game, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.87325 -0.24507 -0.01449 0.23866 1.24218
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.74430 0.08236 21.18 <2e-16 ***
## BB_per_game 0.38742 0.02701 14.34 <2e-16 ***
## HR_per_game 1.56117 0.04896 31.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3484 on 1023 degrees of freedom
## Multiple R-squared: 0.6503, Adjusted R-squared: 0.6496
## F-statistic: 951.2 on 2 and 1023 DF, p-value: < 2.2e-16
The coefficient for bases on balls is 0.39; the coefficient for home runs is 1.56; the intercept is 1.74; the standard error for the BB coefficient is 0.027.
B <- 1000
N <- 100
lse <- replicate(B, {
sample_n(galton_heights, N, replace = TRUE) %>%
lm(son ~ father, data = .) %>% .$coef
})
lse <- data.frame(beta_0 = lse[1,], beta_1 = lse[2,])
What does the central limit theorem tell us about the variables beta_0 and beta_1?
$\beta_0$
> mod <- lm(son ~ father, data = galton_heights)
> summary(mod)
Call:
lm(formula = son ~ father, data = galton_heights)
Residuals:
Min 1Q Median 3Q Max
-5.902 -1.405 0.092 1.342 8.092
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.7125 4.5174 7.91 2.8e-13 ***
father 0.5028 0.0653 7.70 9.5e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
$\beta_0$
What null hypothesis is the second p-value (the one in the father row) testing?
Select ALL that apply.
galton_heights %>% ggplot(aes(father, son)) +
geom_point() +
geom_smooth()
galton_heights %>% ggplot(aes(father, son)) +
geom_point() +
geom_smooth(method = "lm")
model <- lm(son ~ father, data = galton_heights)
predictions <- predict(model, interval = c("confidence"), level = 0.95)
data <- as.tibble(predictions) %>% bind_cols(father = galton_heights$father)
ggplot(data, aes(x = father, y = fit)) +
geom_line(color = "blue", size = 1) +
geom_ribbon(aes(ymin=lwr, ymax=upr), alpha=0.2) +
geom_point(data = galton_heights, aes(x = father, y = son))
model <- lm(son ~ father, data = galton_heights)
predictions <- predict(model)
data <- as.tibble(predictions) %>% bind_cols(father = galton_heights$father)
ggplot(data, aes(x = father, y = fit)) +
geom_line(color = "blue", size = 1) +
geom_point(data = galton_heights, aes(x = father, y = son))
In Questions 7 and 8, you’ll look again at female heights from GaltonFamilies.
Define female_heights, a set of mother and daughter heights sampled from GaltonFamilies, as follows:
set.seed(1989, sample.kind="Rounding") #if you are using R 3.6 or later
## Warning in set.seed(1989, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
options(digits = 3) # report 3 significant digits
female_heights <- GaltonFamilies %>%
filter(gender == "female") %>%
group_by(family) %>%
sample_n(1) %>%
ungroup() %>%
select(mother, childHeight) %>%
rename(daughter = childHeight)
What is the slope of the model?
fit <- lm(mother ~ daughter, data = female_heights)
fit$coef[2]
## daughter
## 0.31
What the intercept of the model?
fit$coef[1]
## (Intercept)
## 44.2
What is the predicted height of the first mother in the dataset?
predict(fit)[1]
## 1
## 65.6
What is the actual height of the first mother in the dataset?
female_heights$mother[1]
## [1] 67
We have shown how BB and singles have similar predictive power for scoring runs. Another way to compare the usefulness of these baseball metrics is by assessing how stable they are across the years. Because we have to pick players based on their previous performances, we will prefer metrics that are more stable. In these exercises, we will compare the stability of singles and BBs.
Before we get started, we want to generate two tables: one for 2002 and another for the average of 1999-2001 seasons. We want to define per plate appearance statistics, keeping only players with more than 100 plate appearances. Here is how we create the 2002 table:
bat_02 <- Batting %>% filter(yearID == 2002) %>%
mutate(pa = AB + BB, singles = (H - X2B - X3B - HR)/pa, bb = BB/pa) %>%
filter(pa >= 100) %>%
select(playerID, singles, bb)
mean_singles) and average BB rate (mean_bb) per player over those three seasons.How many players had a single rate mean_singles of greater than 0.2 per plate appearance over 1999-2001?
bat_99_01 <- Batting %>% filter(yearID %in% 1999:2001) %>%
mutate(pa = AB + BB, singles = (H - X2B - X3B - HR)/pa, bb = BB/pa) %>%
filter(pa >= 100) %>%
group_by(playerID) %>%
summarize(mean_singles = mean(singles), mean_bb = mean(bb))
## `summarise()` ungrouping output (override with `.groups` argument)
sum(bat_99_01$mean_singles > 0.2)
## [1] 46
How many players had a BB rate mean_bb of greater than 0.2 per plate appearance over 1999-2001?
sum(bat_99_01$mean_bb > 0.2)
## [1] 3
inner_join() to combine the `bat_02 table with the table of 1999-2001 rate averages you created in the previous question.What is the correlation between 2002 singles rates and 1999-2001 average singles rates?
dat <- inner_join(bat_02, bat_99_01)
## Joining, by = "playerID"
cor(dat$singles, dat$mean_singles)
## [1] 0.551
What is the correlation between 2002 BB rates and 1999-2001 average BB rates?
cor(dat$bb, dat$mean_bb)
## [1] 0.717
mean_singles versus singles and mean_bb versus bb.Are either of these distributions bivariate normal?
dat %>%
ggplot(aes(singles, mean_singles)) +
geom_point()
dat %>%
ggplot(aes(bb, mean_bb)) +
geom_point()
singles and mean_singles are bivariate normal, but bb and mean_bb are not.bb and mean_bb are bivariate normal, but singles and mean_singles are not.singles given 1999-2001 mean_singles.What is the coefficient of mean_singles, the slope of the fit?
fit_singles <- lm(singles ~ mean_singles, data = dat)
fit_singles$coef[2]
## mean_singles
## 0.588
Fit a linear model to predict 2002 bb given 1999-2001 mean_bb.
What is the coefficient of mean_bb, the slope of the fit?
fit_bb <- lm(bb ~ mean_bb, data = dat)
fit_bb$coef[2]
## mean_bb
## 0.829
lm function does not know how to handle grouped tibbles.lm function cannot be put into a tidy format.You’ve already written the function get_slope, shown below.
get_slope <- function(data) {
fit <- lm(R ~ BB, data = data)
sum.fit <- summary(fit)
data.frame(slope = sum.fit$coefficients[2, "Estimate"],
se = sum.fit$coefficients[2, "Std. Error"],
pvalue = sum.fit$coefficients[2, "Pr(>|t|)"])
}
What additional code could you write to accomplish your goal?
dat %>%
group_by(HR) %>%
do(get_slope)
dat %>%
group_by(HR) %>%
do(get_slope(.))
dat %>%
group_by(HR) %>%
do(slope = get_slope(.))
dat %>%
do(get_slope(.))
dat <- Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(HR = HR/G,
R = R/G) %>%
select(lgID, HR, BB, R)
What code would help you quickly answer this question?
dat %>%
group_by(lgID) %>%
do(tidy(lm(R ~ HR, data = .), conf.int = T)) %>%
filter(term == "HR")
dat %>%
group_by(lgID) %>%
do(glance(lm(R ~ HR, data = .)))
dat %>%
do(tidy(lm(R ~ HR, data = .), conf.int = T)) %>%
filter(term == "HR")
dat %>%
group_by(lgID) %>%
do(mod = lm(R ~ HR, data = .))
lm(R ~ BB + HR)lm(HR ~ BB + singles + doubles + triples)lm(R ~ BB + singles + doubles + triples + HR)lm(R ~ singles + doubles + triples + HR)Look at the code from the video for a hint:
pa_per_game <- Batting %>%
filter(yearID == 2002) %>%
group_by(teamID) %>%
summarize(pa_per_game = sum(AB+BB)/max(G)) %>%
.$pa_per_game %>%
mean
pa_per_game: the mean number of plate appearances per team per game for each teampa_per_game: the mean number of plate appearances per game for each playerpa_per_game: the number of plate appearances per team per game, averaged across all teamsWhich team scores more runs, as predicted by our model?
In the Confounding section, you will learn what is perhaps the most important lesson of statistics: that correlation is not causation.
After completing this section, you will be able to:
This section has one part: Correlation is Not Causation.
The textbook for this section is available here
How many of these correlations would you expect to have a significant p-value (p>0.05), just by chance?